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Abstract.  We  describe  a  model-based  motion  filtering  process  that, 
when  applied  to  human  arm  motion  data,  leads  to  improved  arm  ges¬ 
ture  recognition.  Arm  movements  can  be  viewed  as  responses  to  muscle 
actuations  that  are  guided  by  responses  of  the  nervous  system.  Our  mo¬ 
tion  filtering  method  makes  strides  towards  capturing  this  structure  by 
integrating  a  dynamic  model  with  a  control  system  for  the  arm.  We  hy¬ 
pothesize  that  embedding  human  performance  knowledge  into  the  pro¬ 
cessing  of  arm  movements  will  lead  to  better  recognition  performance. 

We  present  details  for  the  design  of  our  filter,  our  evaluation  of  the  filter 
from  both  expert-user  and  multiple-user  pilot  studies.  Our  results  show 
that  the  filter  has  a  positive  impact  on  recognition  performance  for  arm 
gestures. 

1.  Introduction 

Gesture  recognition  techniques  have  been  studied  extensively  in  recent  years  be¬ 
cause  of  their  potential  for  application  in  user  interfaces.  It  has  long  been  a  goal 
to  apply  the  “natural”  communication  means  that  humans  employ  with  each 
other  to  the  interfaces  of  computers.  People  commonly  use  arm  and  hand  ges¬ 
tures,  ranging  from  simple  actions  of  “pointing”  to  more  complex  gestures  that 
express  their  feelings  and  enhance  communication.  Having  the  ability  to  recog¬ 
nize  arm  gestures  by  computer  would  create  many  possibilities  to  improve  ap¬ 
plication  interfaces,  especially  those  requiring  difficult  data  manipulations  (e.g., 
3D  transformations).  Pointing  operations  would  certainly  be  an  effective  means 
to  infer  directional  information  such  as  where  to  move  an  object  in  a  computer 
environment.  To  date  no  method  has  been  found  for  arm  gesture  recognition 
that  is  both  very  accurate  and  extendable  to  broad  sets  of  gestures.  Typical 
approaches  (e.g.,  HMMs,  neural  networks)  have  focused  on  applying  analytical 
methods  for  breaking  down  motion  sequences  and  recognizing  patterns. 

The  human  model-based  approach  takes  into  consideration  that  while  a  per¬ 
son  is  making  gestures,  the  resulting  motions  and  poses  are  played  out  by  a 
known,  rather  than  an  unknown,  process.  The  gestures  can  be  viewed  as  re¬ 
sponses  of  a  skeletal  frame  to  muscle  actuations  that  are  made  in  response  to 
control  signals  originating  in  the  nervous  system.  The  structure  of  the  skeleton, 
joints,  and  musculature,  is  well  known  and  well  studied.  The  neural  control  sys¬ 
tems  that  actuate  the  muscles  are  becoming  better  understood.  With  a  solid 
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model  of  human  dynamics  and  control,  much  of  the  analytical  heuristic  guess¬ 
work  might  be  eliminated.  The  arm  is  a  good  subject  for  testing  model-based 
approaches  because  it  is  an  articulated  structure  with  well  understood  muscu¬ 
lature  and  fairly  large  inertias  that  must  have  a  significant  effect  on  gesture 
performance. 

We  have  designed  a  motion  adaptation  filter  for  enhancing  the  signal  lead¬ 
ing  to  the  gesture  recognizer  that  integrates  both  physical  and  control  models 
of  human  gesture.  Our  technique  uses  two  motion  filters:  one  augmented  with  a 
“learned”  parametric  gesture  sequence  and  control  system,  and  the  other  unaug¬ 
mented.  Our  method  for  incorporating  process  knowledge — the  model  and  its 
dynamics — is  the  extended  Kalman  filter,  though  any  process  estimation  filter 
could  be  used  that  can  handle  non-linearities.  The  squared  difference  between 
the  outputs  of  both  filters  is  computed  and  normalized,  giving  a  score  that  can 
be  used  by  the  recognition  system. 

Our  working  hypothesis  is  that  the  motion  adaptation  filter  will  improve  the 
unknown  signal’s  quality  enough  to  improve  or  simplify  the  recognition  process. 
We  tested  the  hypothesis  by  integrating  the  filter  with  a  simple  template  gesture 
recognition  system,  although  our  filter  can  be  integrated  with  any  standard  type 
of  gesture  recognition  system.  To  determine  the  impact  that  our  filter  has  on 
arm- movement  recognition  performance,  we  tested  the  system  with  an  expert 
user  performing  multiple  sets  of  gestures  and  with  a  multiple-user  pilot  study. 

2.  Related  Work 

Here  we  briefly  describe  the  most  common  recognition  methods  and  previous 
related  work  utilizing  human  model-based  approaches.  More  complete  details 
can  be  found  in  surveys  by  Watson  [1],  Aggarwal  and  Cai  [2],  Pavlovic  et  al.  [3] 
and  our  technical  report  [4]. 

2.1.  Overview  of  Recognition  Methodologies 

The  common  methodologies  that  have  been  used  for  motion  and  gesture  recog¬ 
nition  are:  (1)  template  matching  [1],  (2)  feature-based  [1],  (3)  statistical  [5],  [6] 
and  (4)  multimodal  probabilistic  combination  [7].  By  far  the  most  popular  recog¬ 
nition  methods  are  feature  based  neural  networks  (e.g.,  [8],  [9],  [10])  and  statis¬ 
tical  hidden  Markov  models  (HMMs)  (e.g.,  [11],  [12],  [13]).  Each  approach  has 
drawbacks  that  either  affect  performance  or  limit  usability.  One  of  the  major 
drawbacks  is  that  most  depend  on  user-specific  training  and  parameter  tuning. 

The  template  approach  compares  the  unclassified  input  sequence  with  a  set 
of  predefined  template  patterns.  The  algorithm  requires  preliminary  work  to 
generate  a  set  of  gesture  patterns,  and  usually  has  poor  performance  due  to 
the  difficulty  of  spatially  and  temporally  aligning  the  input  with  the  template 
patterns  [1], 

The  neural  network  approach  typically  uses  a  pre-determined  set  of  common 
discriminating  features,  estimates  covariances  during  a  training  process,  and  uses 
a  discriminator  (e.g.,  the  classic  linear  discriminator  [14])  to  classify  gestures. 


The  drawback  of  this  method  is  that  features  are  manually  selected  and  time- 
consuming  training  is  involved  [1], 

The  HMM  method  is  a  variant  of  a  finite  state  machine  characterized  by 
a  set  of  states,  a  set  of  observation  symbols  for  each  state,  and  probability 
distributions  for  state  transitions,  observation  symbols  and  initial  states  [5].  The 
major  drawbacks  of  the  HMMs  are:  (1)  they  require  a  set  of  training  gestures 
to  generate  the  state  transition  network  and  tune  parameters;  (2)  they  make 
the  assumption  that  successive  observed  operations  are  independent,  which  is 
typically  not  the  case  with  human  motion  [15]. 

In  a  multimodal  recognition  process,  two  or  more  human  senses  are  captured 
and/or  two  or  more  capturing  technologies  are  combined.  The  multiple  inputs 
are  processed  by  a  classifier,  which  rates  the  set  of  possible  output  patterns  with 
a  value  based  upon  the  likelihood  of  a  match.  The  set  of  probabilities  for  each 
input  are  then  combined  in  a  manner  to  be  able  to  select  the  most  likely  pattern. 
Many  groups  have  explored  combining  speech  and  gesture  (e.g.,  Cohen  et  al.  [7], 
Vo  and  Waibel  [16]). 

2.2.  Methods  Utilizing  Human  Model-Based  Approaches 

Human  model-based  approaches  integrate  a  model  of  human  motion,  typically 
approximated  as  a  dynamic  process  and  control  system,  into  the  process  of  fil¬ 
tering  motion  capture  data  of  human  movements.  Such  a  model-based  approach 
seems  to  have  first  appeared  in  Pentland  and  Horowitz  [17].  Model-based  ap¬ 
proaches  to  motion  generation  for  animation  have  been  utilized  by  Zordan  and 
Hodgins  [18],  Metaxas  [19]  and  others.  Wren  and  Pentland  [20]  applied  dynamics 
to  a  3D  skeletal  model  for  a  tracking  application.  They  applied  2D  measurements 
from  image  features  and  combined  them  with  the  extended  Kalman  filter  to  drive 
the  3D  model.  Their  resulting  tracking  system  was  able  to  tolerate  temporary 
image  occlusions  and  the  presence  of  multiple  people  in  the  tracked  area.  In  more 
recent  work  [21]  they  explored  the  notion  that  people  utilize  muscles  to  actively 
shape  purposeful  motion.  In  earlier  work  [22],  we  explored  the  use  of  a  simple 
particle  model  for  arm  motion  recognition  performance. 

3.  Background 

Here  we  give  the  background  for  methods  that  we  utilized  and  integrated  in  the 
design  of  our  filter. 

3.1.  Extended  Kalman  Filter 

The  extended  Kalman  filter  (EKF)  [23]  estimates  both  the  time  sequence  of 
states  of  an  input  data  stream  and  a  statistical  model  of  that  data  stream. 
The  EKF  differs  from  the  standard  Kalman  filter  [24]  in  that  it  can  be  used  to 
estimate  a  process  that  is  non-linear  and/or  handle  a  measurement  relationship 
to  the  process  that  is  non-linear.  The  EKF  can  be  augmented  by  a  dynamic 
model  of  the  system  being  tracked,  and  knowledge  of  the  reliability  of  this  model. 
Simply  described,  the  filter  is  a  set  of  time  update  equations  that  estimate  the 
next  state  vector,  current  error  covariance  and  the  Kalman  gain.  The  Kalman 
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gain  affects  the  weighting  of  measurement  data  versus  the  control  model  in 
determining  the  next  state  vector  estimate.  If  the  dynamic  model  is  left  out  or 
is  unreliable,  the  Kalman  gain  is  high  and  the  filter  simply  smoothes  the  input 
data. 

The  EKF’s  prediction  equations  may  be  written 

Xf+1  =  /(X«!  Uj,  0) 

Pr+1  =  Ai.Pi.Aj  +  WiQiW?, 

where  /  estimates  the  a  priori  state  vector  xi+1 ,  as  a  function  of  the  current 
state  vector  x*,  and  the  process  model  vector  u*  at  the  current  time  step.  Pi  and 
Pi+i  are  the  current  and  a  priori  estimated  error  covariances,  Qi  is  the  process 
model  error  covariance,  A  and  W  are  the  Jacobians  of  /  with  respect  to  the 
state  x  and  a  vector  of  random  variables  w. 

The  filter’s  update  equations  may  be  written 

Ki  =  P,  Hf(H}Pi  Hf  +  VtRtVj')  1 

xi  =  xi  +  Ki(zi  -  ft(x4  ,  0))  (2) 

Pi  =  (/  -  KiHi)Pr , 

where  Ki  is  the  current  Kalman  gain  matrix,  v  is  a  vector  of  random  variables, 
h  relates  the  state  vector  to  the  measurement  vector  z ,,  Ri  is  the  measurement 
error  covariance,  and  H  and  V  are  the  Jacobians  of  h  with  respect  to  x  and  v. 

3.2.  Lagrangian  Formulation  for  Dynamics 

The  Lagrangian  formulation  for  dynamics  is  particularly  appropriate  for  articu- 
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lated  systems.  The  Lagrangian 


L( q,  q)  =  Ek( q,  q)  -  Ep( q)  (3) 

is  the  difference  between  the  kinetic  energy  Ek  and  potential  energy  Ep  of  the 
system  as  a  function  of  state  q.  The  state  is  a  set  of  generalized  joint  coordinates 
and  its  rate  q  is  a  set  of  related  velocities.  The  Lagrangian  formulation  for  the 
dynamics  of  a  system  is 

TtW-Wi=Ti'  *  =  (4) 

where  r  is  the  set  of  externally  applied  or  nonconservative  forces  and  torques 

[25]. 

Solutions  to  Equation  4  can  be  found  in  closed  form,  which  are  more  efficient 
and  readily  parameterizable  than  the  open  form  derivations  generated  by  the 
Featherstone  algorithm  [26],  which  is  a  very  efficient  rendition  of  the  Newton- 
Euler  approach  to  dynamics  [27].  On  the  other  hand,  the  open  form  derivations 
do  have  the  advantage  that  they  can  be  easily  extended  to  handle  large  sets  of 
joint-space  configurations. 


4.  Motion  Adaptation  Filter 

The  design  of  our  model-based  motion  adaptation  filter  is  shown  in  Figure  1. 
Its  two  extended  Kalman  filters  each  contain  a  model  of  the  human  arm  and 
its  dynamics.  Only  one  is  augmented  with  a  model  of  a  control  system  acting 
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Fig.  1.  Motion  Adaptation  Filter 


Fig.  2.  Five  Gestures  Influenced  by  an  Arc  Motion  Sequence 

on  the  arm.  The  input  unknown  motion  sequence  is  passed  through  each  filter, 
compared  and  a  score  is  computed,  which  is  used  as  output  for  the  motion 
adaptation  filter. 

The  unaugmented  filter  simply  smoothes  the  input  motion  sequence.  Since 
it  contains  a  control  system,  the  augmented  filter  attempts  to  influence  the  raw 
input  motion  sequence  to  follow  a  learned  motion  sequence.  We  illustrate  this 
notion  in  Figure  2  by  showing  five  different  motion  sequences  (arc,  line,  wave, 
circle  and  angle)  as  influenced  by  a  control  system  generating  an  arc.  Each 
sequence  starts  on  the  right  side  and  proceeds  towards  the  left.  The  darkest 
grey  line  indicates  the  “influencing”  arc  sequence,  the  lightest  grey  is  the  input 
sequence,  and  the  mid-grey  is  the  output  sequence.  The  images  show  the  degree 
of  influence  that  the  arc  controller  has  on  each  of  the  input  sequences.  The  degree 
of  this  influence  is  determined  by  the  Kalman  gain. 

The  unaugmented  and  augmented  filters  both  contain  units  for  motion  state 
estimation  and  dynamics  update.  The  state  estimation  unit  blends  the  input 
motion  sequence  with  the  current  state  vector  and  passes  the  data  to  the  dy- 
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Fig.  3.  Articulated  Arm  Model 

namics  update  process.  There,  forward  dynamics  are  performed  on  the  state 
vector  producing  angular  accelerations.  These  are  numerically  integrated  gener¬ 
ating  the  next  state  vector.  The  next  state  vector  is  fed  back  into  the  system  at 
the  Kalman  blend  and  sent  to  be  compared  with  the  output  from  the  augmented 
filter.  The  Kalman  gain  is  updated  from  the  current  error  covariance  which  is 
subsequently  updated  by  data  from  the  dynamics  update  process. 

The  augmented  filter’s  control  system  is  composed  of  a  driving  torque  con¬ 
troller  and  a  blending  function.  Torques  used  by  the  controller  are  derived  from 
the  parametric  learned  motion  sequence  and  model  and  applied  to  the  forward 
dynamics  of  the  system.  After  numerical  integration,  an  intermediate  state  vec¬ 
tor  is  passed  to  the  blending  function  where  it  is  mixed  with  the  aligned  and 
parameterized  learned  motion  sequence  producing  the  next  state  vector.  The 
motivation  behind  the  augmented  filter  is  that  if  the  input  motion  sequence 
matches  closely  to  the  learned  motion  sequence  (e.g.,  in  Figure  2  the  arc  in 
arc  module),  then  the  resulting  trajectory  should  be  very  similar  to  the  input. 
Thus  the  trajectories  output  by  the  unaugmented  and  augmented  filters  will  be 
nearly  identical,  and  the  output  score  will  be  small.  However,  if  the  input  motion 
sequence  is  dissimilar  (e.g.,  in  Figure  2  the  line  in  arc  module)  to  the  learned 
sequence,  the  trajectories  will  differ  greatly  and  likewise  the  score  will  be  large. 
4.1.  Arm  Model 

A  dynamic  articulated  model  of  a  human  arm  is  integrated  into  the  filter.  The 
arm  model  consists  of  a  3-DOF  shoulder  joint,  a  1-DOF  elbow  joint  and  cylinder 
linkages  between  the  shoulder  and  elbow,  and  between  the  elbow  and  wrist.  The 
model  is  shown  in  Figure  3.  We  ignore  the  wrist  twist  in  the  lower  arm.  We  also 
capture  the  three  degrees  of  freedom  for  the  torso,  which  is  used  to  produce  a 
relative  coordinate  system  for  the  arm.  The  three  degrees  of  freedom  from  the 
torso  are  eliminated  after  the  coordinate  transformation  takes  place  between  the 
torso  and  shoulder. 

The  position  of  the  wrist  and  elbow  can  be  determined  by  using  the  kine¬ 
matics  equations  of  motion  for  the  arm  model.  The  equations  are  parameterized 
using  joint  angles  for  each  degree  of  freedom  of  the  joints  in  the  model.  They  are 

Xe  =  {-luSeCfr - luSeS -luCe)T, 

xw  =  Rz(<t>)Ry(0)(-hSpCa,  -lLSpSa,  -IlCp)t, 


(5) 


where  xe  and  xw  are  the  positions  of  the  elbow  and  wrist,  respectively,  lu  and 
II  are  the  corresponding  lengths  of  the  upper  and  lower  arm,  Rz(4> )  and  Ry(9) 
are  rotation  matrices  about  the  respective  axes  z  and  y ,  and  5  and  C  are  sines 
and  cosines  of  angles  of  rotation  9,  <p,  a  and  p. 

4.2.  Motion  State  Estimation 

Motion  state  estimation  is  used  to  predict  the  state  vector  at  the  next  time  step 
for  the  current  state  of  measured  input,  dynamic  model  and  statistical  models  of 
the  measured  and  control  systems.  The  statistics  for  the  measurement  process 
and  control  system  are  in  the  form  of  error  covariance  matrices  and  are  pre¬ 
determined  using  training  and  measurements  from  the  user  workspace.  They  are 
used  by  the  EKF  along  with  data  from  the  dynamics  update  process  to  determine 
the  current  Kalman  gain. 

The  Kalman  gain  is  critical  for  state  estimation  in  the  system  and  requires 
knowledge  from  the  dynamics  and  measurement  processes.  These  data  include 
the  four  (8x8)-Jacobian  matrices  A.  W.  H  and  V  from  Equations  1  and  2,  which 
relate  the  process  and  measurement  system’s  state  vectors  to  the  current  state 
vector.  The  analytic  equations  for  the  elements  of  these  matrices  are  predeter¬ 
mined  and  their  values  updated  as  the  filter  operates.  They  are 


and  H  =  V  =  I  where  I  is  the  8x8-identity  matrix.  The  matrices  A  and  W 
are  updated  by  taking  the  partial  derivatives  with  respect  to  the  current  state 
vector  of  their  respective  complete  forward  dynamics  equation  g.  The  augmented 
and  unaugmented  filters  have  different  formulations.  The  formulation  for  the 
augmented  filter  is 

g{ q,q,  Wi,w2)  =  BM[j(q  +  w2)T^[B'](q  +  w2)- 
B'(q  +  w2)  +  r(qm,  q7")], 
and  for  the  unaugmented  filter  is 

S(q,q,wi,w2)  =  5'-1[i(q  +  w2)T^[S,](q  +  w2)- 
Z?'(q  +  w2)], 

where  wq  and  w2  are  vectors  of  random  variables  representing  “white”  noise 
with  zero  mean  and  constant  variance  associated  with  the  process  model’s  state 
vector  and  velocities,  respectively.  B  and  B  are  the  inertia  matrices  defined  in 
Section  4.3  composed  of  members  from  the  state  vector  q  and  angular  velocities 
q.  B'  and  B'  are  similar  matrices  to  B  and  B  but  wherever  an  element  of  q  and 
q  appears,  the  appropriate  random  variable  from  the  vectors  wi  or  w2  is  added 
to  that  member.  For  example,  if  9  appears  in  an  element  of  matrix  B,  then  in 
B'  it  is  replaced  by  9  +  w±i,  the  first  element  in  the  vector  wq,  since  9  is  the 
first  element  in  q. 

4.3.  Dynamics  Update 

The  dynamics  update  process  provides  parameter  updates  for  motion  state  es¬ 
timation  and  the  control  system.  It  takes  the  current  state  of  the  system  and 


the  arm  model  (and  a  set  of  torques  for  the  augmented  filter),  and  performs 
forward  dynamics  to  produce  the  parameter  update  functions  g  (described  in 
Section  4.2)  and  the  angular  accelerations  q.  Our  experiments  showed  that  Eu¬ 
ler  numerical  integration  [28]  was  adequate  for  updating  the  next  state  vector 
using  the  accelerations. 

The  forward  dynamics  equation  for  the  4-DOF  articulated  arm  model  gen¬ 
erates  the  angular  accelerations  and  is  used  to  derive  the  complete  forward  dy¬ 
namics  equations  (Equations  6  and  7).  In  order  to  derive  these  equations,  the 
masses,  lengths  and  moments  of  inertia  of  the  arm  segments  are  needed.  Each 
arm  segment  is  represented  by  a  thin  cylinder  rotating  about  its  endpoint.  The 
center  of  mass  for  each  cylinder  is  estimated  using  data  from  a  study  on  anthro¬ 
pometric  parameters  for  the  human  body  in  [29].  The  data  gives  estimations  for 
the  segmental  center  of  mass  (COM)  locations  expressed  in  percentages  of  the 
segment  lengths.  These  are  measured  from  the  proximal  end  of  the  segments. 
The  moment  of  inertia  for  each  segment  is  computed  by  combining  the  inertia 
tensor  of  the  representative  cylinder  body  and  inertial  component  associated 
with  the  shift  of  its  COM  to  the  endpoint.  The  inertial  components  associated 
with  the  shift  of  the  COM  are 

Xu  =  (  I'uSfjCoj  -ruSeStf,,  -ruCe)T , 

XL  =  RMRy(0)^rLSpCa,  - rLSpSa ,  -rLCp)T, 

where  \u  and  \l  are  the  positions  in  Cartesian  world  space  of  the  estimated 
COMs  of  the  upper  and  lower  arm,  respectively,  and  ry  and  r l  are  the  cor¬ 
responding  radial  distances  from  the  shoulder  and  elbow,  respectively.  Time 
derivatives  are  taken  to  get  the  angular  velocities  at  the  estimated  COMs  of  the 
arm  segments.  These  are 


Xi  =  JA,  i=  {U,L}  (9) 

where  the  Jacobian  matrices  Jy  =  and  JL  =  ,  and  q  =  (9,  <fi,  a,  p)T . 

The  inertial  components  are 

Iu  =  muJyJu  +  Ibodyu, 

II  =mLJlJL  +  IbodyL, 

where  Iu  and  II  are  the  inertial  components  of  the  upper  and  lower  arm,  respec¬ 
tively,  mu  and  tox  are  the  estimated  masses  of  the  arm  segments,  and  Ibodyu 
and  IbodyL  are  diagonal  matrices  representing  the  thin  cylinder  body  inertias 
about  each  parameterized  of  the  axes  9 ,  (f),  a  and  p.  The  elements  in  Ibodyu 
and  IbodyL  are  determined  by  converting  the  cylinder’s  Euclidean  coordinates 
to  spherical  coordinates. 

The  angular  velocities  and  inertias  are  used  to  compute  the  kinetic  energy 

Ek  =  iqTSq,  (11) 

where  B  =  Iy  + 1 l-  The  potential  energy  is  given  as 

Ep  =  -mygryCt  n 

-mLg[luCt  -  rLSrCaSt  +  rLCtCr\ , 


where  g  is  the  gravitational  constant.  The  two  energy  terms  are  used  for  the 
Lagrangian,  L.  of  Equations  3  and  4.  The  dynamics  equations  are  computed 
and  solved  for  angular  acceleration 

q  =  B-1[±qT£[B]q-Bq  +  T],  (13) 

where  r  is  the  set  of  applied  torques. 

4.4.  Control  System 

Our  control  system  acts  as  an  analogue  to  the  motor  nervous  system  in  the 
human  body,  influencing  how  the  learned  motion  sequence  acts  on  the  current 
motion  state.  It  is  composed  of  a  driving  torque  controller  and  a  blending  func¬ 
tion.  The  driving  torque  controller  uses  data  from  the  learned  motion  sequence 
and  arm  model  and  performs  inverse  dynamics,  which  generates  torques  for  the 
dynamics  update  process.  The  blending  function  combines  the  learned  motion 
sequence  with  an  intermediate  state  vector  from  the  dynamics  update  process. 
The  degree  of  its  influence  is  controlled  by  a  fixed  predetermined  blending  fac¬ 
tor.  The  learned  motion  sequence  also  remains  fixed  throughout  the  iteration 
of  the  filter.  We  see  the  driving  torque  controller  as  analagous  to  an  open-loop 
predictive  control  and  the  blending  function  as  analagous  to  proprioceptive  and 
sensory  feedback.  Our  control  system  has  similarities  to  the  model  reference 
adaptive  control  (MRAC)  system  presented  in  [30],  [31],  which  incorporates  a 
reference  model  of  a  motion  sequence,  inverts  its  dynamics  and  applies  the  re¬ 
sulting  torques  in  a  controlled  manner  to  the  input  data. 

The  torques  for  the  driving  torque  controller  are  computed  using  the  inverse 
dynamics  torque  formulation 

T(q,q)  =  T(qm,qm)  +  ±qT^Bq-Bq.  (14) 

where  r  is  the  vector  of  applied  torques  from  the  controller,  and  joint  angles  qrn 
and  angular  velocities  qm  are  from  the  influencing  gesture  sequence.  The  joint 
configurations  are  transformed  so  that  they  correlate  with  the  learned  model’s 
joint  configurations. 

Since  there  is  no  feedback  in  the  driving  torque  controller,  the  torques  can  be 
precomputed.  When  T(q,  q)  is  applied  to  the  dynamics  it  influences  the  motion  of 
the  model  to  follow  a  trajectory  analogous  to  the  influencing  sequence.  However, 
it  is  not  necessarily  strongly  influencing  the  raw  motion  data  to  move  towards  the 
learned  motion  sequence.  The  strength  of  the  influence  is  controlled  by  a  scaling 
parameter  kc  that  is  applied  to  the  Kalman  filter’s  process  model  error  covariance 
matrix  Q.  This  affects  how  much  the  system  “trusts”  the  raw  motion  data  versus 
the  dynamic  model.  As  kc  changes  it  directly  impacts  how  the  reported  controller 
error  relates  to  the  measurement  error  in  the  system.  As  a  result,  the  Kalman 
filter’s  gain  matrix  K  (Equation  2),  stabilizes  differently,  therefore  changing  how 
the  Kalman  filter  weights  input  motion  versus  controller  influence. 

The  blending  function  supplements  the  driving  torque  controller  by  providing 
more  guidance  to  the  state  estimation.  The  driving  torque  controller  provides  the 
dynamics  drive  for  the  model,  but  it  does  not  always  provide  sufficient  guidance. 
The  influencing  motion  sequence’s  torques  may  be  nonlinear  with  respect  to  the 


Fig.  4.  Test  Recognition  System  Architecture 

joint  configurations,  but  the  tracking  system  performs  blending  of  joint  config¬ 
urations  linearly.  Therefore,  due  to  linear  blending,  small  changes  in  the  joint 
configurations  can  produce  large  changes  in  the  dynamics.  This  directly  affects 
how  the  driving  torque  controller  performs.  The  blending  function  is  intended 
to  counteract  this  effect. 

The  blending  function  incorporates  the  current  state  of  the  system  with  the 
raw  motion  data  from  a  learned  motion  sequence.  The  raw  motion  data  includes 
the  joint  angles  and  angular  velocities.  This  data  is  linear  with  respect  to  the 
motion  state  configurations  of  the  system.  The  blending  function  that  we  use  is 

xi+i  =  6(xj  +  A txi)  +  (1  -  6)xJ",  (15) 

where  x*  =  [q,  q]T ,  x*  =  [q,q]T,  x™  =  [ qm,qm]T ,  At  is  the  current  time  step, 
and  b  is  the  blending  factor. 

5.  Analysis  of  Filter 

In  order  to  test  its  effectiveness,  we  implemented  our  new  filter,  selected  a 
difhcult-to-discriminate  gesture  dataset,  and  ran  user  studies. 

5.1.  Design  of  Test  System 

We  designed  a  system  to  test  the  motion  adaptation  filter  by  adapting  a  sim¬ 
ple  template-style  gesture  recognizer.  We  chose  the  template  recognition  system 
because  it  is  easy  to  implement  and  is  very  easy  to  understand.  However,  our 
filter  can  work  with  most  standard  recognition  architectures  with  some  minor 
modifications  (e.g.,  see  notes  in  Section  7).  The  template  architecture  works  by 
comparing  the  unknown  input  sequence  with  each  gesture  pattern.  For  our  case, 
the  unknown  input  is  passed  through  a  motion  adaptation  filter  associated  with 
each  gesture  (see  Figure  4  for  an  overview). 

Human  motion  data  is  brought  into  the  system  by  a  motion  tracking  unit 
and  segmented  by  searching  for  long  pauses  in  the  motion  sequences.  The  choice 
of  tracking  system  is  arbitrary,  as  long  it  can  generate  a  continuous  sequence 
of  motion  states.  For  this  architecture,  the  output  is  distributed  in  parallel  to 
N  copies  of  the  filter.  Each  of  the  filters  is  custom-tuned  for  a  specific  gesture. 
The  output  of  the  filters  is  a  set  of  scores  that  are  processed  by  the  recognition 
unit.  The  scores  are  the  squared  differences  of  the  internal  unaugmented  and 
augmented  filters. 

Although  our  filter  can  accept  tracking  data  from  any  motion  capturing  tech¬ 
nology,  for  purposes  of  testing  we  found  it  convenient  to  use  a  magnetic  tracking 
system.  There  are  obviously  more  accurate  input  technologies  (e.g.,  acoustic 
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Fig.  5.  Wrist-Trajectory  Shapes  of  the  Gesture  Datasets  used  for  the  Expert  User 
Experiments 
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Fig.  6.  Overlapping  Features  Embedded  in  Gesture  Pairs 

and  inertial)  and  vision  systems,  but  due  to  occlusion,  they  do  not  guarantee  a 
continuous  reliable  stream  of  input. 

We  capture  orientations  of  the  lower  arm,  upper  arm,  and  torso  to  to  retrieve 
the  required  four  Euler  angles.  We  estimate  angular  velocities  using  time  differ¬ 
ence  methods.  The  set  of  angles  and  angular  velocities  makes  up  a  motion  state 
vector.  The  sequence  of  state  vectors  is  sent  to  the  motion  state  estimation  unit. 

5.2.  Selection  of  a  Hard— to— Discriminate  Gesture  Dataset 

Our  first  step  for  analyzing  the  performance  of  the  filter  was  to  select  a  set  of 
gestures  that  are  hard  to  distinguish  from  each  other.  The  selection  criterion  was 
determined  by  observing  trajectories  of  the  wrist  for  each  gesture.  The  trajec¬ 
tories  for  the  gesture  dataset  we  selected  for  the  introductory  experiments  are 
shown  in  Figure  5.  This  gesture  set  has  many  overlapping  features,  as  can  be 
seen  in  Figure  6.  Two  distinct  gestures  that  have  overlapping  motion  segments, 
especially  if  they  start  with  the  same  motion  sub-sequence,  are  more  difficult 
to  distinguish  than  dissimilar  nonoverlapping  gestures.  A  properly  tuned  EKF 
bases  its  initial  output  more  on  the  input  data  than  the  dynamic  model.  But, 
when  it  converges  to  a  stable  blending  state,  the  dynamics  of  the  system  takeover. 
If  two  gestures  have  similar  starting  trajectories  and  abruptly  change  after  the 
dynamics  become  more  dominant,  the  system  will  initially  fail  to  discriminate 
between  the  two  gestures  because  the  derived  dynamics  of  the  system  are  sim¬ 
ilar.  Eventually  the  mixture  of  the  two  dissimilar  segments  of  the  gestures  will 
influence  and  change  the  system  behavior. 

For  our  experiments,  we  also  considered  the  direction  in  which  the  motion  was 
performed,  thus  expanding  the  five  basic  shapes  to  ten.  We  used  combinations 
of  the  five  basic  shapes  to  generate  gesture  datasets  and  test  the  performance, 
generalizability  and  extensability  of  our  approach  in  four  of  five  expert-user 
experiments. 

5.3.  Filter  Parameters 

Our  filter  requires  a  set  of  parameters  that  must  be  predetermined  and  tuned  for 
individual  gestures.  The  EKF  requires  error  covariance  data  for  the  measurement 
and  control  processes.  The  dynamics  update  requires  measurements  from  the 
user’s  arm.  Each  control  system  requires  a  blending  constant  and  a  learned 
motion  sequence. 


Parameter  Determination  To  compute  the  measurement  error  covariance 
we  affixed  three  motion  tracking  receivers  in  the  user  workspace  to  a  stationary 
configuration  analogous  to  that  of  the  right  arm.  We  recorded  1000  samples 
continuously  and  computed  the  error  covariance  matrix  computed  using  the 
sampled  angles  and  estimated  angular  velocities.  The  measurement  covariance 
matrix  needs  to  be  computed  once  for  a  given  combination  of  hardware  and 
workspace. 

The  control  process  error  is  computed  by  using  the  pre-recorded  gesture  se¬ 
quences.  A  parametric  learned  motion  sequence  for  each  gesture  type  is  selected 
by  determining  the  closest  fitting  trajectory  to  a  normal  trajectory  that  is  com¬ 
puted  from  the  sample  set  of  gestures.  The  error  matrix  is  estimated  using  the 
mean  squared  error  between  the  parametric  learned  motion  sequence  and  the 
rest  of  the  sequences.  The  control  error  needs  to  be  computed  for  every  gesture 
sequence. 

Subject  Measurements  Some  of  the  parameters  needed  for  the  filters  are 
taken  from  measurements  of  the  users.  The  filters  require  the  lengths,  radii  and 
masses  of  the  upper  and  lower  arm.  These  parameters  are  obtained  by  combina¬ 
tions  of  two  methods:  direct  measurements  and  estimation  from  anthropometric 
parameters  of  the  human  body.  The  lengths  are  determined  by  either  directly 
measuring  the  distance  between  the  shoulder  and  elbow,  and  elbow  and  wrist, 
or  estimating  them  from  the  height  and  sex  of  the  user.  Estimations  of  anthro¬ 
pometric  parameters  are  made  according  to  the  procedure  outlined  in  Hall  [29] . 
The  radii  are  obtained  by  measuring  the  circumferences  of  the  arm  segments  at 
the  midpoint.  The  masses  for  the  arm  segments  are  determined  as  percentages 
of  the  whole  body  mass  for  males  and  females. 

Parameter  Tuning  In  order  to  use  the  EKF,  specific  parameters  have  to  be 
tuned  in  order  to  get  desirable  guidance  in  the  recognition  units.  One  of  the 
parameters  that  needs  tuning  is  a  multiplicative  factor  kc  used  to  scale  the 
augmented  filter’s  control  error  covariance.  There  is  one  such  scaling  factor  for 
each  control  error  covariance  matrix.  The  scaling  factor  is  used  to  adjust  the 
level  of  “trust”  in  the  filter  by  changing  the  control  error  with  respect  to  the 
measurement  error.  The  larger  kc  is,  the  more  the  filter  output  depends  on  the 
input.  The  smaller  kc  is,  the  more  the  filter  output  depends  on  the  controller  and 
dynamic  model.  As  a  result  the  Kalman  gain  matrix,  essential  for  the  Kalman 
blend,  changes.  A  similar  single  parameter  is  adjusted  for  the  unaugmented  filter. 

Another  parameter  to  be  tuned  is  the  blending  factor  b.  This  is  applied  in  the 
blending  function,  which  performs  a  blend  of  the  intermediate  state  vector 
and  the  parametric  learned  motion  sequence.  This  factor  is  important  because 
it  weights  how  the  raw  data  is  blended  with  the  parametric  learned  motion 
sequence.  The  Kalman  blend  does  not  directly  incorporate  knowledge  of  the 
parametric  learned  motion  sequence.  We  used  one  blending  factor  for  all  the 
gesture  types.  More  details  about  the  choice-of  and  tuning  of  these  parameters 
is  described  in  our  technical  report  [4]. 

An  important  consideration  when  selecting  the  parameters  is  the  degree  of 
alignment  of  the  input  gesture  with  respect  to  the  learned  gesture.  In  the  exper- 


iments,  we  ask  the  users  to  extend  their  right  arm  perpendicular  to  the  chest. 
The  gestures  they  are  asked  to  perform  are  then  roughly  centered  around  that 
hand  position.  Rough  alignment  and  scaling  is  applied  to  the  parametric  learned 
gesture  in  addition  to  the  parameterizing  that  is  necessary  to  perform  a  match¬ 
ing  comparison.  This  is  the  registration  phase,  which  can  be  seen  on  the  right 
side  of  the  filter  diagram  in  Figure  1.  If  the  parametric  learned  gesture  does  not 
align  very  well  with  the  gesture  it  is  supposed  to  accept,  it  creates  a  high  score 
for  the  comparison.  This  is  due  to  our  method  for  evaluation  which  compares 
the  augmented  and  raw  input  trajectories.  If  the  alignment  is  extremely  bad  we 
could  not  adjust  the  kc  parameter  to  “trust”  the  model  as  much.  In  most  cases 
this  is  not  a  problem,  but  for  a  difficult  dataset  to  recognize,  such  as  the  basic 
five  gestures  in  Figure  5,  some  gestures  will  be  improperly  classified. 

Sensitivity  Analysis  If  we  were  to  run  a  full  user  study  on  human  subjects  of 
widely  varying  mass  and  height,  it  would  be  important  to  understand  how  much 
of  an  impact  parameter  changes  have  on  the  dynamics  of  the  system.  If  it  can  be 
shown  that  the  system  is  relatively  insensitive  to  changes  in  the  parameters  then 
it  may  be  considered  to  be  more  generalizable  and  potentially  more  powerful. 
We  analyzed  the  sensitivity  of  a  few  of  the  body  parameters  (summarized  in 
Schmidt  [32]),  but  did  not  determine  enough  meaningful  information  to  make 
conclusions  about  the  generalizability  of  our  filter. 

5.4.  Expert  User  Experiments 

We  set  out  to  verify  the  effectiveness  of  the  filter  integrated  into  a  gesture  rec¬ 
ognizer  by  devising  a  set  of  experiments  to  be  performed  by  an  expert  user. 
These  were  designed  to  test  the  performance  of  the  recognizer  with  and  with¬ 
out  our  filter.  We  also  wanted  to  ascertain  something  about  how  generalizable 
and  extensable  our  filter  is  with  respect  to  different  and  larger  gesture  datasets. 
To  accomplish  these  goals,  we  ran  five  experiments.  Before  beginning,  we  pre¬ 
recorded  a  database  of  gestures  from  the  user,  computed  the  parameters  and 
learned  models,  and  performed  manual  parameter  tuning. 

Accuracy  Performance  The  purpose  of  the  first  experiment  was  to  determine 
the  performance  rating  of  the  recognizer  integrated  with  and  without  our  filter. 
We  used  the  five  gestures  from  Table  1,  and  recorded  100  samples  for  each 
gesture.  The  gestures  were  first  aligned  with  the  learned  motion  sequences,  then 
the  learned  motion  sequences  were  parameterized  to  match  the  size  of  the  input 
sequence.  We  supplied  both  the  filtered  (our  method)  and  unfiltered  recognizers 
with  the  500  gestures.  The  results  are  given  in  Table  1. 

They  show  that  both  methods  have  an  accuracy  rating  of  99.4%.  The  fact  that 
both  methods  produced  acceptable  results  turned  out  to  be  only  coincidental 
for  the  unfiltered  approach,  which  was  later  shown  to  be  very  inconsistent.  We 
analyzed  this  dataset  further  and  noticed  that  the  gestures  were  fairly  spatially 
regular  with  respect  to  each  other.  For  example,  there  was  not  an  extensive 
amount  of  variation  due  to  alignment,  skewing  and  scaling  among  the  like  ges¬ 
tures  in  this  set. 


Table  1.  Results  of  Experiment  #1 
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Fig.  7.  Arm  Model  Motion  in  Time 


To  get  a  better  idea  of  how  our  method  works,  refer  back  to  Figure  2.  The 
arc  in  the  arc  module  shows  the  best  match  between  the  augmented  and  the 
unaugmented  (effectively  the  learned  motion  sequence)  trajectories.  The  rest  of 
the  cases  show  that  the  learned  arc  sequence  has  a  large  influence  on  the  data 
running  through  the  augmented  filter  which  is  evident  by  the  output  augmented 
trajectories.  This  effect  pulls  the  augmented  and  raw  data  curves  apart.  The 
sequences  in  Figure  7  illustrate  a  small  set  of  state  transitions  from  the  three 
arm  models  used  in  generating  the  trajectories  for  the  line  and  the  arc  in  the 
arc  module.  The  figures  show  frames  from  a  3D  simulation  of  the  corresponding 
schematic  4-DOF  arm  models.  The  arm  states  are  very  similar  for  the  arc  in  the 
arc  module,  but  very  different  for  the  line  in  the  arc  module. 

Generalizability  To  test  the  generalizability  of  our  approach,  we  ran  a  second 
experiment.  In  the  experiment  we  used  the  reverse-order  wrist  trajectories  from 
the  gestures  used  in  the  first  experiment  (a  completely  unique  dataset).  We 
recorded  100  samples  for  each  of  the  five  gestures  and  purposely  added  noise 


into  the  samples  to  test  the  robustness  of  our  filter.  Then  we  passed  them  into 
the  gesture  recognizer  twice,  with  and  without  our  filter  in  the  system.  The 
resulting  performance  ratings  are  given  in  Table  2. 


Table  2.  Results  of  Experiment  #2 


r\ 

Arc 

◄ - 

Line 

% 

Wave 

O 

Circle 

A 

Angle 

Totals 

60/100 

60% 

100/100 

100% 

78/100 

78% 

62/100 

62% 

99/100 

99% 

Unfiltered 

Approach 

79.8% 

98/100 

98% 

100/100 

100% 

100/100 

100% 

100/100 

100% 

96/100 

96% 

Our  Filtered 
Approach 
98.8% 

In  this  case,  the  accuracy  of  the  recognizer  integrated  with  our  filter  proved  to 
be  far  superior  than  without  it.  The  performance  rating  for  our  filtered  approach 
is  98.8%,  while  the  unfiltered  is  79.8%. 

Extensibility  For  the  third  experiment,  we  examined  the  extensibility  of  our 
approach.  To  do  this,  we  increased  the  number  of  distinct  gestures  that  the 
recognizer  had  to  distinguish.  We  used  the  two  sets  of  gestures  from  the  first 
two  experiments  and  combine  them  into  one  database.  Although  diagrams  make 
the  two  gesture  sets  appear  similar,  the  motions  that  the  human  subject  has 
to  perform  with  the  arm  are  totally  different.  When  we  performed  the  same 
experimental  procedure  as  before,  the  results  show  our  method  has  an  accuracy 
rating  of  99.1%  while  the  unfiltered  approach  has  a  rating  of  89.6%.  This  gives  us 
a  good  indication  that  our  method  is  extensable  to  larger  size  gesture  datasets. 

More  Generalizability  Experiments  At  this  point  we  decided  to  revisit  the 
first  experiment  with  the  hope  of  making  it  more  difficult  to  distinguish  the 
gestures  than  before.  The  goals  of  the  fourth  experiment  were  to  show  more 
generalizability  with  our  method.  In  order  to  do  this,  we  replaced  the  line  and 
the  wave  with  a  triangle  and  another  form  of  the  arc.  The  new  arc  gesture  is 
generated  using  a  bend  at  the  elbow  instead  of  the  straight  arm  motions  used 
for  the  original  arc.  By  our  definition  of  arm  gestures  (i.e.  movements  of  the  arm 
that  may  or  may  not  have  any  meaningful  intent)  and  our  analysis  of  only  the 
“end-effector”  position  of  the  arm  at  the  wrist,  we  do  not  make  any  distinction 
between  the  new  and  old  arc  gesture  since  both  have  identical  wrist  trajectories. 
The  triangle  gesture  resembles  the  angle  gesture  in  the  first  time  steps,  but 
deviates  from  it  near  the  end.  Our  assumption  was  that  this  choice  of  gestures 
would  be  harder  to  discriminate.  75  trials  were  run  for  each  gesture. 

The  experimental  results  show  that  the  new  gesture  set  was  a  bit  harder  to 
recognize  by  both  methods.  The  triangle  and  bent-arm  arc  were  recognized  90.7% 
and  86.7%,  respectively  for  the  unfiltered  approach,  and  98.7%  and  96.0%  for 
our  approach.  Our  filtered  approach  showed  an  overall  accuracy  rating  of  98.1% 
compared  with  the  unfiltered  approach’s  rating  of  95.2%.  The  results  were  again 


encouraging  with  regard  to  our  method’s  consistency  and  accuracy,  and  also  that 
it  generalizes  to  different  gestures  quite  well. 

For  our  fifth  experiment  we  ran  50  trials  with  five  new  gestures,  each  sig¬ 
nificantly  different  from  the  others.  In  addition,  we  decided  to  make  a  choice  of 
somewhat  natural  gestures.  The  goal  of  the  experiment  was  to  determine  if  our 
method  works  well  with  gestures  that  are  very  easy  to  distinguish  because  they 
are  quite  distinct  and  are  more  natural.  Our  choices  included  the  “zorro”  sign, 
Catholic  cross,  salute,  wave,  and  stop  gestures.  Diagrams  of  the  motions  of  the 
wrist  and  results  of  the  experiment  are  shown  in  Table  3. 

Table  3.  Results  of  Experiment  #5 


Zorro 

4- 

Catholic 

Cross 

Waving 

V 

Stop 

V 

Salute 

Totals 

50/50 

100% 

46/50 

92% 

50/50 

100% 

50/50 

100% 

50/50 

100% 

Unfiltered 

Approach 

98.4% 

50/50 

100% 

50/50 

100% 

50/50 

100% 

50/50 

100% 

50/50 

100% 

Our  Filtered 
Approach 
100% 

The  results  show  that  our  method  was  100%  accurate  on  this  gesture  set, 
while  the  unfiltered  approach  achieved  an  accuracy  rating  of  98.4%. 

Discussion  In  the  experiments,  we  evaluated  the  accuracy  performance,  gener- 
alizability  and  extensability  of  our  filter  when  integrated  in  a  recognition  system. 
We  made  steps  to  ensure  that  it  was  difficult  to  distinguish  among  gestures  by 
carefully  selecting  gesture  datasets  with  overlapping  motion  traits.  When  com¬ 
pared  with  the  recognizer  with  no  filter  attached,  our  method  showed  improved 
recognition  performances.  Our  results  from  the  five  experiments  show  that  our 
method  is  consistently  accurate  with  rates  ranging  from  98.1%  to  99.4%  and 
extends  to  multiple  gesture  datasets.  This  compares  very  favorably  with  the 
unfiltered  method  whose  accuracy  ranged  from  79.8%  to  99.4%. 

6.  Pilot  Study 

We  performed  a  pilot  study  involving  six  different  subjects,  in  order  to  evaluate 
our  model-based  approach  across  different  subjects. 

6.1.  Subject  Selection 

For  the  experiment,  we  selected  three  males  and  three  females,  with  varying 
anatomical  proportions.  The  sex  discriminant  was  desired  to  accommodate  for 
potential  differing  mass  distributions  in  the  arm  between  male  and  female  sub¬ 
jects,  based  on  muscle  and  bone  proportions.  The  proportions  we  were  concerned 
with  were  the  lengths,  radii  and  masses  of  the  upper  and  lower  right  arm.  The 
subjects  were  selected  without  regard  to  ethnicity,  age,  social  or  cultural  back¬ 
grounds.  The  only  screening  requirement  we  had  was  a  visual  observation  of  size 
proportions  in  order  to  assure  a  subject  pool  of  varying  anatomical  proportions. 


Fig.  8.  Comparison  of  the  Unfiltered  and  Filtered  Approaches 

6.2.  Subject  Measurements 

The  subjects  had  body  weights  ranging  from  55  to  87  kg  and  heights  ranging 
from  1.6  to  1.9  m,  giving  us  a  broad  spectrum  of  masses  and  lengths  for  the 
user’s  arm  proportions.  The  upper  arm  lengths  varied  from  28  to  36  cm  and  the 
lower  arm  lengths  from  23  to  27  cm.  The  upper  arm  radii  varied  from  3.66  to 
5.25  cm  and  the  lower  arm  radii  from  3.18  to  4.38  cm.  Trackers  were  attached 
using  velcro  straps  at  the  wrist  and  near  the  elbow.  A  third  was  affixed  with 
tape  to  the  shoulder. 

6.3.  Pilot  Experiment 

In  the  first  subject  experiment  our  goal  was  to  compare  the  difference  between 
augmenting  the  recognition  process  with  a  model  versus  not  augmenting  the 
process.  The  subjects  were  asked  to  perform  25  trials  of  each  of  five  different 
gestures,  using  the  right  arm.  In  between  each  set  of  trials  for  one  gesture,  the 
subject  was  given  ample  rest  time  to  help  avert  any  fatigue  associated  with  the 
repetitive  motions  they  were  asked  to  make.  We  used  the  same  five  gestures  as 
illustrated  in  Table  3,  the  zorro,  Catholic  cross,  stop,  salute  and  waving  gestures. 

The  results  we  obtained  were  measurements  of  how  well  each  recognition  sys¬ 
tem  predicted  the  correct  gesture  sequence.  The  performance  rating  for  the  two 
methods — the  unaugmented  and  our  model-based  approach — were  computed  by 
averaging  the  performances  for  each  of  five  different  gestures.  The  performance 
for  each  gesture  was  computed  by  averaging  the  results  from  each  of  the  six 
subjects.  The  histogram  chart  shown  in  Figure  8  compares  the  two  sets  of  data. 

The  data  for  each  user  was  analyzed  by  setting  the  body  parameters  for  the 
recognizer  to  their  measurements  before  running  the  accuracy  tests.  The  rest 
of  the  parameters  for  the  recognizer  were  individually  tuned  for  each  subject. 
The  results  for  our  model-based  approach  show  an  overall  acceptance  rate  of 
98.7%  with  standard  deviation  of  1.0%.  The  unaugmented  approach  performed 
at  93.5%  acceptance  rate  with  standard  deviation  of  3.7%.  The  high  acceptance 
rate  and  low  variability  that  our  results  show  give  us  a  fairly  good  indication  that 
integrating  our  filter  into  the  recognition  process  improves  recognition  accuracy. 


A  drawback  of  this  experiment  is  that  a  significant  amount  of  custom  param¬ 
eter  tuning  was  required  for  each  subject.  As  a  result,  we  decided  to  evaluate 
whether  or  not  our  methodology  would  allow  us  to  reduce  the  tuning  effort  re¬ 
quired  by  each  experiment.  We  ran  a  set  of  followup  experiments  to  test  these 
ideas.  The  results  were  somewhat  limited.  More  details  can  be  found  in  our 
technical  report  [4]. 

7.  Discussion  and  Conclusions 

We  have  developed  a  new  model-based  filter  that  incorporates  a  dynamics  model, 
a  control  system  and  motion  state  estimation  and  applied  it  to  the  gesture  recog¬ 
nition  process.  The  dynamic  model  gives  us  a  way  to  represent  the  underlying 
mechanical  motion  of  the  human  arm.  The  control  system  acts  as  a  means  to 
exert  control  over  and  provide  guidance  for  the  motion  applied  by  the  dynamics. 

Our  filter  proved  to  be  effective  in  improving  the  performance  of  the  recogni¬ 
tion  process  as  shown  by  our  expert-user  and  pilot  user  studies.  We  showed  this 
by  comparing  an  unfiltered  recognition  process  with  one  augmented  with  our 
model-based  filter.  Our  method  works  acceptably  well  for  hard-to-distinguish 
gesture  sets  and  even  better  for  very  dissimilar  sets.  The  results  definitely  war¬ 
rant  further  user  evaluation  studies. 

Our  method  does  involve  a  small  amount  of  parameter  tuning  and  training 
for  the  error  covariances.  A  lot  of  the  tuning  is  associated  with  the  registration 
of  the  input  and  learned  gestures.  Obviously,  if  the  registration  problem  can  be 
solved,  a  lot  of  the  tuning  can  be  eliminated.  It  also  might  be  the  case  that  more 
sophisticated  models  for  the  human  motion  or  a  more  extensive  model  of  the 
human  body  would  reduce  the  need  for  some  of  the  parameters. 

One  issue  that  our  work  did  not  address  is  the  differences  that  may  occur 
with  people  tracing  the  same  “end-effector”  path  with  different  arm  and  joint 
configurations.  For  example,  the  “bent-arm”  arc  used  in  the  fourth  expert-user 
experiment  has  an  equivalent  wrist  trajectory  as  the  “straight-arm”  arc  had  in 
the  first  experiments.  We  analyzed  only  the  wrist  trajectories,  although  we  could 
have  additionally  analyzed  either  the  elbow  or  joint  configuration  trajectories. 
This  in  effect  increases  the  size  of  the  gesture  alphabet. 

We  only  tested  our  filter  with  a  template  recognition  architecture.  However, 
we  feel  that  it  can  be  easily  modified  for  use  with  a  neural  network  recognizer.  By 
removing  the  unaugmented  sub-filter  component  the  only  output  would  be  the 
augmented  filtered  sequence.  If  we  setup  n  filters  so  that  each  input  to  the  system 
produces  n  output  sequences  from  the  filters  (for  n  distinct  gesture  patterns), 
each  of  these  outputs  will  be  different  amongst  themselves  but  fairly  unique  for 
each  given  input  pattern.  Then,  extracting  features  from  each  output  sequence 
which  could  yield  m  x  n  different  features  for  the  neural  network.  If  desired,  more 
features  could  be  added  from  the  raw  or  unaugmented  filtered  input  sequence. 
The  rest  should  follow  the  same  as  any  neural  network.  The  advantage  of  this 
(untested)  setup  would  be  that  the  filter  could  be  used  to  generate  many  more 
unique  discriminating  features.  While  this  is  not  always  an  advantage,  if  the 


features  are  good  discriminating  ones  we  believe  the  discriminator  should  me 
more  powerful. 

Based  on  our  evaluation  studies,  we  can  conclude  that  our  motion  adaptation 
filter  makes  a  positive  contribution  to  the  performance  of  gesture  recognition  for 
arm-based  gestures.  This  seems  to  imply  that  a  model  of  human  performance 
can  be  used  to  eliminate  some  of  the  heuristic  guess-work  that  must  be  done  to 
make  a  standard  gesture  recognizer  work. 
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